Optimal Adaptive Learning in Uncontrolled Restless Bandit Problems
نویسندگان
چکیده
In this paper we consider the problem of learning the optimal policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each of which when pulled yields a positive reward. There is a player who sequentially selects one of the arms at each time step. The goal of the player is to maximize its undiscounted reward over a time horizon T . The reward process of each arm is a finite state Markov chain, whose transition probabilities are unknown by the player. State transitions of each arm is independent of the selection of the player. We propose a learning algorithm with logarithmic regret uniformly over time with respect to the optimal finite horizon policy. Our results extend the optimal adaptive learning of MDPs to POMDPs. Index Terms Online learning, restless bendits, POMDPs, regret, exploration-exploitation tradeoff
منابع مشابه
Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret
In this paper we consider the problem of learning the optimal dynamic policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each of which when played yields a non-negative reward. There is a player who sequentially selects one of the arms at each time step. The goal of the player is to maximize its undiscounted reward over a...
متن کاملOptimal Policies for a Class of Restless Multiarmed Bandit Scheduling Problems with Applications to Sensor Management
Consider the Markov decision problems (MDPs) arising in the areas of intelligence, surveillance, and reconnaissance in which one selects among different targets for observation so as to track their position and classify them from noisy data [9], [10]; medicine in which one selects among different regimens to treat a patient [1]; and computer network security in which one selects different compu...
متن کاملParticle Filtering And Restless Bandits 1 Running Head: PARTICLE FILTERS AND RESTLESS BANDITS Modeling Human Performance in Restless Bandits with Particle Filters
Bandit problems provide an interesting and widely-used setting for the study of sequential decision-making. In their most basic form, bandit problems require people to choose repeatedly between a small number of alternatives, each of which has an unknown rate of providing reward. We investigate restless bandit problems, where the distributions of reward rates for the alternatives change over ti...
متن کاملModeling Human Performance in Restless Bandits with Particle Filters
Bandit problems provide an interesting and widely-used setting for the study of sequential decision-making. In their most basic form, bandit problems require people to choose repeatedly between a small number of alternatives, each of which has an unknown rate of providing reward. We investigate restless bandit problems, where the distributions of reward rates for the alternatives change over ti...
متن کاملThe achievable region method in the optimal control of queueing systems; formulations, bounds and policies
We survey a new approach that the author and his co-workers have developed to formulate stochastic control problems (predominantly queueing systems) as mathematical programming problems. The central idea is to characterize the region of achievable performance in a stochastic control problem, i.e., find linear or nonlinear constraints on the performance vectors that all policies satisfy. We pres...
متن کامل